Platform Explorer / Nuxeo Platform 2021.65

Bundle org.nuxeo.importer.stream

In bundle group org.nuxeo.ecm.platform

Documentation

  • README.md

    nuxeo-importer-stream

    About

    This module defines a producer/consumer pattern and uses the Log features provided by Nuxeo Stream.

    Producer/Consumer pattern with automation operations

    The Log is used to perform mass import.

    It decouples the Extraction/Transformation from the Load (using the ETL terminology).

    The extraction and transformation is done by a document message producer with custom logic.

    This module comes with a random document and a random blob generator, that does the same job as the random importer of the nuxeo-importer-core module.

    The load into Nuxeo is done with a generic consumer.

    Automation operations are exposed to run producers and consumers.

    Two steps import: Generate and Import documents with blobs

    1. Run a random producers of document messages, these message represent Folder and File document a blob. The total number of document created is: nbThreads * nbDocuments.
    curl -X POST 'http://localhost:8080/nuxeo/site/automation/StreamImporter.runRandomDocumentProducers' -u Administrator:Administrator -H 'content-type: application/json' \
    -d '{"params":{"nbDocuments": 100, "nbThreads": 5}}'
    
    ParamsDefault Description
    nbDocumentsThe number of documents to generate per producer thread
    nbThreads8The number of concurrent producer to run
    avgBlobSizeKB1The average blob size fo each file documents in KB. If set to 0 create File document without blob.
    langen_USThe locale used for the generated content, can be fr_FR or en_US 
    logNameimport/docThe name of the Log.
    logSize$nbThreadsThe number of partitions in the Log which will fix the maximum number of consumer threads
    logBlobInfoA Log name containing blob information to use, see section below for use case
    1. Run consumers of document messages creating Nuxeo documents, the concurrency will match the previous nbThreads producers parameters
    curl -X POST 'http://localhost:8080/nuxeo/site/automation/StreamImporter.runDocumentConsumers' -u Administrator:Administrator -H 'content-type: application/json' \
    -d '{"params":{"rootFolder": "/default-domain/workspaces"}}'
    
    ParamsDefaultDescription
    rootFolderThe path of the Nuxeo container to import documents, this document must exists
    repositoryNameThe repository name used to import documents
    nbThreadslogSizeThe number of concurrent consumer, should not be greater than the number of partition in the Log
    batchSize10The consumer commit documents every batch size
    batchThresholdS20The consumer commit documents if the transaction is longer that this threshold
    retryMax3Number of time a consumer retry to import in case of failure
    retryDelayS2Delay between retries
    logNameimport/docThe name of the Log to tail
    useBulkModefalseProcess asynchronous listeners in bulk mode
    blockIndexingfalseDo not index created document with Elasticsearch
    blockAsyncListenersfalseDo not process any asynchronous listeners
    blockPostCommitListenersfalseDo not process any post commit listeners
    blockDefaultSyncListenersfalseDisable some default synchronous listeners: dublincore, mimetype, notification, template, binarymetadata and uid

    4 steps import: Generate and Import blobs, then Generate and Import documents

    1. Run producers of random blob messages
    curl -X POST 'http://localhost:8080/nuxeo/site/automation/StreamImporter.runRandomBlobProducers' -u Administrator:Administrator -H 'content-type: application/json' \
    -d '{"params":{"nbBlobs": 100, "nbThreads": 5}}'
    
    ParamsDefaultDescription
    nbBlobsThe number of blobs to generate per producer thread
    nbThreads8The number of concurrent producer to run
    avgBlobSizeKB1The average blob size fo each file documents in KB
    langen_USThe locale used for the generated content, can be "fr_FR" or "en_US" 
    logNameimport/blobThe name of the Log to append blobs.
    logSize$nbThreadsThe number of partitions in the Log which will fix the maximum number of consumer threads
    1. Run consumers of blob messages importing into the Nuxeo binary store, saving blob information into a new Log.
    curl -X POST 'http://localhost:8080/nuxeo/site/automation/StreamImporter.runBlobConsumers' -u Administrator:Administrator -H 'content-type: application/json' \
    -d '{"params":{"blobProviderName": "default", "logBlobInfo": "blob-info"}}'
    
    ParamsDefaultDescription
    blobProviderNamedefaultThe name of the binary store blob provider
    logNameimport/blobThe name of the Log that contains the blob
    logBlobInfoimport/blob-infoThe name of the Log to append blob information about imported blobs
    nbThreads$logSizeThe number of concurrent consumer, should not be greater than the number of partitions in the Log
    retryMax3Number of time a consumer retry to import in case of failure
    retryDelayS2Delay between retries
    1. Run producers of random Nuxeo document messages which use produced blobs created in step 2
    curl -X POST 'http://localhost:8080/nuxeo/site/automation/StreamImporter.runRandomDocumentProducers' -u Administrator:Administrator -H 'content-type: application/json' \
    -d '{"params":{"nbDocuments": 200, "nbThreads": 5, "logBlobInfo": "blob-info"}}'
    

    Same params listed in the previous previous runRandomDocumentProducers call, here we set the logBlobInfo parameter.

    1. Run consumers of document messages
    curl -X POST 'http://localhost:8080/nuxeo/site/automation/StreamImporter.runDocumentConsumers' -u Administrator:Administrator -H 'content-type: application/json' \
    -d '{"params":{"rootFolder": "/default-domain/workspaces"}}'
    

    Same params listed in the previous previous runDocumentConsumers call.

    Create blobs using existing files

    Create a file containing the list of files to import then:

    1. Generate blob messages corresponding to the files, dispatch the messages into 4 partitions:
    curl -X POST 'http://localhost:8080/nuxeo/site/automation/StreamImporter.runFileBlobProducers' -u Administrator:Administrator -H 'content-type: application/json' \
    -d '{"params":{"listFile": "/tmp/my-file-list.txt", "logSize": 4}}'
    
    ParamsDefaultDescription
    listFile The path to the listing file
    basePath ''The base path to use as prefix of each file listed in the listFile
    nbBlobs0The number of blobs to generate per producer thread, 0 means all entries, loop on listFile entries if necessary
    nbThreads1The number of concurrent producer to run
    logNameimport/blobThe name of the Log to append blobs.
    logSize$nbThreadsThe number of partitions in the Log which will fix the maximum number of consumer threads

    The you can use the 3 others steps describes the above section to import blobs with 4 threads and create documents.

    Note that the type of document will be adapted to the detected mime type of the file so that

    • image file will generate a Picture document
    • video file will generate a Video document
    • other type will be translated to File document

    Generate random file for testing purpose

    For testing purpose it can be handy to generate different file from an existing one, the goal is to generate lots of unique files with a limited set of files.

    To do this you need to first generates blob messages pointing to file (see previous section) and choose the nbBlobs corresponding to the expected number of blob to import, (use a greater number that the existing files).

    The next step is to add some special option to blob consumer so that instead of importing the existing file, a watermark will be added to the blob before importing it.

    1. Run consumers of blob messages adding watermark to file and importing into the Nuxeo binary store, saving blob information into a new Log.
    curl -X POST 'http://localhost:8080/nuxeo/site/automation/StreamImporter.runBlobConsumers' -u Administrator:Administrator -H 'content-type: application/json' \
    -d '{"params":{"watermark": "foo"}}'
    

    The additional parameters are:

    ParamsDefaultDescription
    watermarkAsk to add a watermark to the file before importing it, use the provided string if possible.
    persistBlobPathUse a path if you want to keep the generated files on disk
    blobProviderNamedefaultIf blank there is no Nuxeo blob import, this can be useful for import with Gatling/Redis

    Continue with other steps described above to generate and create documents.

    Note that only few mime type are supported for watermark so far:

    • text/plain: Insert a uniq tag at the beginning of text.
    • image/jpeg: Set the exif software tag to a uniq tag.
    • video/mp4: Set the title with the uniq tag.

    Import document using REST API via Gatling/Redis

    Instead of doing mass import creating document by batch with the efficient internal API, you can save them into Redis in a way it can be used by Gatling simulation, this way we can stress the REST API.

    To do this instead of the document creationg step 4 we do:

    1. Run Redis consumers of document messages
    curl -X POST 'http://localhost:8080/nuxeo/site/automation/StreamImporter.runRedisDocumentConsumers' -u Administrator:Administrator -H 'content-type: application/json' \
    -d '{"params":{"rootFolder": "/default-domain/workspaces"}}'
    

    Note that the Nuxeo must be configured with Redis (nuxeo.redis.enabled=true).

    After this you need to use simulations in nuxeo-distribution/nuxeo-jsf-ui-gatling-tests/:

    # init the infra, creating a group of test users and a workspace
    mvn -nsu gatling:test -Dgatling.simulationClass=org.nuxeo.cap.bench.Sim00Setup -Pbench -DredisDb=0 -Durl=http://localhost:8080/nuxeo
    
    # import the folder structure
    mvn -nsu gatling:test -Dgatling.simulationClass=org.nuxeo.cap.bench.Sim10CreateFolders -Pbench -DredisDb=0 -Durl=http://localhost:8080/nuxeo
    
    # import the documents using 8 concurrent users
    mvn -nsu gatling:test -Dgatling.simulationClass=org.nuxeo.cap.bench.Sim20CreateDocuments -Pbench -DredisDb=0 -Dusers=8 -Durl=http://localhost:8080/nuxeo
    
    

    The node running the Gatling simulation must have access to the files to import.

    Here is an overview of possible usage to generate mass import and load tests with the stream importer:

    import diagram

    Visit nuxe-jsf-ui-gatling for more information.

    Building

    To build and run the tests, simply start the Maven build:

    mvn clean install
    

    About Nuxeo

    Nuxeo dramatically improves how content-based applications are built, managed and deployed, making customers more agile, innovative and successful. Nuxeo provides a next generation, enterprise ready platform for building traditional and cutting-edge content oriented applications. Combining a powerful application development environment with SaaS-based tools and a modular architecture, the Nuxeo Platform and Products provide clear business value to some of the most recognizable brands including Verizon, Electronic Arts, Sharp, FICO, the U.S. Navy, and Boeing. Nuxeo is headquartered in New York and Paris. More information is available at www.nuxeo.com.

  • Parent Documentation: README.md

    Nuxeo Platform Importer

    About Nuxeo Platform Importer

    The file importer comes as a Java library (with nuxeo runtime service) and a sample JAX-RS interface to launch, monitor and abort import jobs. This project is an on-going project, supported by Nuxeo

    Building

    How to Build Nuxeo Platform Importer

    Build the Nuxeo Platform Importer with Maven: $ mvn install -Dmaven.test.skip=true

    Deploying

    Nuxeo Platform Importer is available as two package add-ons [from the Nuxeo Marketplace] https://connect.nuxeo.com/nuxeo/site/marketplace/package/nuxeo-platform-importer https://connect.nuxeo.com/nuxeo/site/marketplace/package/nuxeo-scan-importer

    Resources

    Documentation

    The documentation for Nuxeo Platform Importer is available in our Documentation Center: http://doc.nuxeo.com/x/gYBVAQ

    Reporting Issues

    You can follow the developments in the Nuxeo Platform project of our JIRA bug tracker, which includes a Nuxeo Platform Importer component: https://jira.nuxeo.com/browse/NXP/component/10621

    You can report issues on: http://answers.nuxeo.com/

    About Nuxeo

    Nuxeo dramatically improves how content-based applications are built, managed and deployed, making customers more agile, innovative and successful. Nuxeo provides a next generation, enterprise ready platform for building traditional and cutting-edge content oriented applications. Combining a powerful application development environment with SaaS-based tools and a modular architecture, the Nuxeo Platform and Products provide clear business value to some of the most recognizable brands including Verizon, Electronic Arts, Sharp, FICO, the U.S. Navy, and Boeing. Nuxeo is headquartered in New York and Paris. More information is available at www.nuxeo.com.

Resolution Order

218
The resolution order represents the order in which this bundle's single component has been resolved by the Nuxeo Runtime framework.
You can influence this order by adding "require" tags in the component declaration, to make sure it is resolved after another component. It will also impact the order in which contributions are registered on their target extension point (see "Registration Order" on contributions).

Components

Packages

Maven Artifact

Filenuxeo-importer-stream-2021.65.6.jar
Group Idorg.nuxeo.ecm.platform
Artifact Idnuxeo-importer-stream
Version2021.65.6

Manifest

Manifest-Version: 1.0
Archiver-Version: Plexus Archiver
Created-By: Apache Maven
Built-By: root
Build-Jdk: 11.0.25
Bundle-ManifestVersion: 1
Bundle-Version: 2021.65.6-t20241231-040318
Bundle-SymbolicName: org.nuxeo.importer.stream;singleton:=true
Bundle-Name: Nuxeo Importer Stream
Bundle-Vendor: Nuxeo
Nuxeo-Component: OSGI-INF/operations-contrib.xml

Exports

Charts

    Raw Data: Json Contribution Stats

    Contributions by Code Type

    Loading data

    Contributions by Target Extension Point

    Loading data

    Contributions by Studio Source

    Loading data